Unable to use thrust::sort_by_keys() in kernel with stack allocated arrays

berlier.v · June 6, 2021, 8:15pm

I’m on fedora with CUDA 11.3 and a 2070 Max-Q and I’m writing a kernel that looks like this:

__global__ void foo()
{
    double a[123];
    int b[123];

    // ...

    thrust::sort_by_key(thrust::device, a, a + 123, b);
}

My understanding is that specifying thrust::device allows the function to operate inside a kernel but I get an error 717 “operation not supported on global/shared address space”. Is there a better way to do this?

Robert_Crovella · June 6, 2021, 11:32pm

I don’t have any problem with what you have shown here on CUDA 11.1

$ cat t53.cu
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <cstdio>

__global__ void foo()
{
  double a[123];
  int b[123];
  for (int i = 0; i < 123; i++) {
    b[i] = 123-i;
    a[i] = 123-i;}
  thrust::sort_by_key(thrust::device, a, a+123, b);
  for (int i = 0; i < 10; i++) printf("b[%d] = %d\n", i, b[i]);
}

int main(){

  foo<<<1,1>>>();
  cudaDeviceSynchronize();}

$ nvcc -o t53 t53.cu
$ cuda-memcheck ./t53
========= CUDA-MEMCHECK
b[0] = 1
b[1] = 2
b[2] = 3
b[3] = 4
b[4] = 5
b[5] = 6
b[6] = 7
b[7] = 8
b[8] = 9
b[9] = 10
========= ERROR SUMMARY: 0 errors
$

For this particular use-case, I would say that thrust::seq is a more sensible execution policy than thrust::device but according to my test above thrust::device works similarly for that case.

berlier.v · June 7, 2021, 1:18am

Thanks a lot for the working example. The code you provided worked perfectly fine. I thought the problem would boil down to something similar but apparently not. I didn’t think about thrust::seq and you’re right that’s probably what I was looking for. However I noticed something pretty strange:

========= Invalid __global__ read of size 8
=========     at 0x00000100 in void cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<double, int, int>::Policy800, bool=0, double, int, int>(int const *, cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<double, int, int>::Policy800, bool=0, double, int, int>*, cub::DeviceRadixSortPolicy<double, int, int>::Policy800 const *, cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<double, int, int>::Policy800, bool=0, double, int, int>**, bool=0, int, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f97f3fffb40 is out of bounds

In the actual code this is triggered by thrust::sort_by_key() but only when I set the execution policy to thrust::device. When I use thrust::seq everything’s fine, I don’t get any error anymore, so I should be okay but I didn’t think it was possible for an error like this to disappear by changing the execution policy.

Robert_Crovella · June 7, 2021, 1:44am

thrust::device has the possibility to invoke CUDA Dynamic Parallelism, on a supported GPU with a supported build environment (e.g. with relocatable device code and device linking).

If you are building that way, you may run into an issue, because local memory cannot be a data source for a CDP kernel. In that case, I would expect an error perhaps like what you are reporting. But there is no way to tell conclusively based on scant information and a code snippet.

thrust::seq has no such possibility to invoke CDP. And furthermore my provided build command in my case does not provide a proper build environment for CDP, so thrust::device also cannot invoke CDP that way.

berlier.v · June 7, 2021, 8:43am

I did some further testing and you’re right, I’m building with CMake and CDP is enabled so that explains why the standalone example would work fine. The fact that thrust::device falls back to sequential execution and that local memory cannot be a data source for a CDP kernel explains a lot of the inconsistent results I had. Thanks for the help this cleared things up a lot!

Topic		Replies	Views
Calling thrust::sort_by_key from kernel CUDA Programming and Performance	1	4466	March 3, 2016
CUDA 8 - Thrust bug(?) CUDA Programming and Performance	16	4310	October 2, 2016
Thrust:sort does nothing CUDA Programming and Performance	3	588	May 31, 2017
cuda thrust::sort_by_key system error CUDA Programming and Performance	8	1773	October 24, 2017
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5260	May 7, 2015
How thrust sorting works ? CUDA Programming and Performance	0	439	August 21, 2017
when large number stack overflow CUDA Programming and Performance	1	1135	October 19, 2017
thrust::sort_by_key has race conditions? CUDA Programming and Performance	1	928	October 19, 2015
Seems thrust::sort bug with device_vector CUDA Programming and Performance	1	625	August 20, 2022
thrust::stable_sort_by_key cause Aborted (core dumped) CUDA Programming and Performance	2	904	August 26, 2019

Unable to use thrust::sort_by_keys() in kernel with stack allocated arrays

Related topics