CUDA "Warp Out of Address" when instantiating cudaExtractCluster and cudaFilter objects (cuPCL)

Problem Explaination

Hello there ! I am trying to use the cuPCL repository: GitHub - NVIDIA-AI-IOT/cuPCL: A project demonstrating how to use the libs of cuPCL. such to preprocess the PointCloud by a Voxel Downsampling Filter prior to using a the defined Clusterer. The program runs smoothly without the Voxel Downsampling , but the problem comes when only making an Instance of the filter as shown below:

cudaExtractCluster cudaec(stream);
cudaFilter filterTest(stream);
cudaFilter filterTest(stream2); (Same or different streams does not matter)

So just using one works, but both produces: Cuda failure: an illegal memory access was encountered at line 138 in file cudaFilter.cpp error status: 700

After some trials with Debugging with CUDA-GDB and CUDA MEMCHECK I came to the following results but do not quite sure if they can be solved as the classes are implemented in a precompiled .so files:

  • Both classes invoke the cudaFillVoxelGirdKernel, and the error occurs on the Kernel Launch of the first function call that invokes the Kernel Launch :
Thread 1 "collision_avoid" received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 6, block (3,0,0), thread (160,0,0), device 0, sm 6, warp 4, lane 0]
0x0000555555d50eb0 in cudaFillVoxelGirdKernel(float4*, int4*, int4*, float4*, unsigned int, float, float, float) ()
  • The Thread is trying to write 4 bytes into some Global Memory address (CUDA MEMCHECK):
Invalid __global__ write of size 4
  • And from debugging:
Illegal access to address (@global)0x8007b0800c60 detected
(cuda-gdb) print *0x8007b0800c60
Error: Failed to read local memory at address 0x8007b0800c60 on device 0 sm 0 warp 9 lane 0, error=CUDBG_ERROR_INVALID_MEMORY_ACCESS(0x8).

What I do not understand is that from the Thread’s scope the address is treated as a local address , but actually it seems to be a global one. Note that for memory transfer cudaMemMallocManaged has been used (UVM), and even using explicit memory transfers did not solve the issue.

Other efforts to solve the issue was to limit all CUDA computations to match the Device limits as follows:

  size_t limit = 0;
  cudaDeviceGetLimit(&limit, cudaLimitStackSize);
  std::cout << "Stack limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitStackSize, limit);

  cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
  std::cout << "cudaLimitPrintfFifoSize limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitPrintfFifoSize, limit);

  cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
  std::cout << "cudaLimitMallocHeapSize limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitMallocHeapSize, limit);

  cudaDeviceGetLimit(&limit, cudaLimitDevRuntimeSyncDepth);
  std::cout << "cudaLimitDevRuntimeSyncDepth limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, limit);

  cudaDeviceGetLimit(&limit, cudaLimitDevRuntimePendingLaunchCount);
  std::cout << "cudaLimitDevRuntimePendingLaunchCount limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, limit);

  cudaDeviceGetLimit(&limit, cudaLimitMaxL2FetchGranularity);
  std::cout << "cudaLimitMaxL2FetchGranularity limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitMaxL2FetchGranularity, limit);

But not changes have been yielded.

Device Info

| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   56C    P8    18W /  N/A |    123MiB /  7982MiB |     32%      Default |
|                               |                      |                  N/A |
  Dev PCI Bus/Dev ID  Name Description                                   SM Type  
*   0  01:00.0        NVIDIA GeForce RTX 2080 Super with Max-Q Design     TU104-A   
SMs    Warps/SM Lanes/Warp Max Regs/Lane    Active SMs Mask 
sm_75  48       32         32           256 0x00000000000000000000ffffffffffff

Using Ros Noetic and Ubuntu 20.04