Unable to use thrust::sort_by_keys() in kernel with stack allocated arrays

I’m on fedora with CUDA 11.3 and a 2070 Max-Q and I’m writing a kernel that looks like this:

__global__ void foo()
{
    double a[123];
    int b[123];

    // ...

    thrust::sort_by_key(thrust::device, a, a + 123, b);
}

My understanding is that specifying thrust::device allows the function to operate inside a kernel but I get an error 717 “operation not supported on global/shared address space”. Is there a better way to do this?

I don’t have any problem with what you have shown here on CUDA 11.1

$ cat t53.cu
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <cstdio>

__global__ void foo()
{
  double a[123];
  int b[123];
  for (int i = 0; i < 123; i++) {
    b[i] = 123-i;
    a[i] = 123-i;}
  thrust::sort_by_key(thrust::device, a, a+123, b);
  for (int i = 0; i < 10; i++) printf("b[%d] = %d\n", i, b[i]);
}

int main(){

  foo<<<1,1>>>();
  cudaDeviceSynchronize();}

$ nvcc -o t53 t53.cu
$ cuda-memcheck ./t53
========= CUDA-MEMCHECK
b[0] = 1
b[1] = 2
b[2] = 3
b[3] = 4
b[4] = 5
b[5] = 6
b[6] = 7
b[7] = 8
b[8] = 9
b[9] = 10
========= ERROR SUMMARY: 0 errors
$

For this particular use-case, I would say that thrust::seq is a more sensible execution policy than thrust::device but according to my test above thrust::device works similarly for that case.

Thanks a lot for the working example. The code you provided worked perfectly fine. I thought the problem would boil down to something similar but apparently not. I didn’t think about thrust::seq and you’re right that’s probably what I was looking for. However I noticed something pretty strange:

========= Invalid __global__ read of size 8
=========     at 0x00000100 in void cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<double, int, int>::Policy800, bool=0, double, int, int>(int const *, cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<double, int, int>::Policy800, bool=0, double, int, int>*, cub::DeviceRadixSortPolicy<double, int, int>::Policy800 const *, cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<double, int, int>::Policy800, bool=0, double, int, int>**, bool=0, int, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f97f3fffb40 is out of bounds

In the actual code this is triggered by thrust::sort_by_key() but only when I set the execution policy to thrust::device. When I use thrust::seq everything’s fine, I don’t get any error anymore, so I should be okay but I didn’t think it was possible for an error like this to disappear by changing the execution policy.

thrust::device has the possibility to invoke CUDA Dynamic Parallelism, on a supported GPU with a supported build environment (e.g. with relocatable device code and device linking).

If you are building that way, you may run into an issue, because local memory cannot be a data source for a CDP kernel. In that case, I would expect an error perhaps like what you are reporting. But there is no way to tell conclusively based on scant information and a code snippet.

thrust::seq has no such possibility to invoke CDP. And furthermore my provided build command in my case does not provide a proper build environment for CDP, so thrust::device also cannot invoke CDP that way.

I did some further testing and you’re right, I’m building with CMake and CDP is enabled so that explains why the standalone example would work fine. The fact that thrust::device falls back to sequential execution and that local memory cannot be a data source for a CDP kernel explains a lot of the inconsistent results I had. Thanks for the help this cleared things up a lot!