CUDA "Warp Out of Address" when instantiating cudaExtractCluster and cudaFilter objects (cuPCL)

kameel.amareen · March 24, 2023, 9:38am

Problem Explaination

Hello there ! I am trying to use the cuPCL repository: GitHub - NVIDIA-AI-IOT/cuPCL: A project demonstrating how to use the libs of cuPCL. such to preprocess the PointCloud by a Voxel Downsampling Filter prior to using a the defined Clusterer. The program runs smoothly without the Voxel Downsampling , but the problem comes when only making an Instance of the filter as shown below:

cudaExtractCluster cudaec(stream);
cudaFilter filterTest(stream);
or
cudaFilter filterTest(stream2); (Same or different streams does not matter)

So just using one works, but both produces: Cuda failure: an illegal memory access was encountered at line 138 in file cudaFilter.cpp error status: 700

After some trials with Debugging with CUDA-GDB and CUDA MEMCHECK I came to the following results but do not quite sure if they can be solved as the classes are implemented in a precompiled .so files:

Both classes invoke the cudaFillVoxelGirdKernel, and the error occurs on the Kernel Launch of the first function call that invokes the Kernel Launch :

Thread 1 "collision_avoid" received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 6, block (3,0,0), thread (160,0,0), device 0, sm 6, warp 4, lane 0]
0x0000555555d50eb0 in cudaFillVoxelGirdKernel(float4*, int4*, int4*, float4*, unsigned int, float, float, float) ()

The Thread is trying to write 4 bytes into some Global Memory address (CUDA MEMCHECK):

Invalid __global__ write of size 4

And from debugging:

Illegal access to address (@global)0x8007b0800c60 detected
(cuda-gdb) print *0x8007b0800c60
Error: Failed to read local memory at address 0x8007b0800c60 on device 0 sm 0 warp 9 lane 0, error=CUDBG_ERROR_INVALID_MEMORY_ACCESS(0x8).

What I do not understand is that from the Thread’s scope the address is treated as a local address , but actually it seems to be a global one. Note that for memory transfer cudaMemMallocManaged has been used (UVM), and even using explicit memory transfers did not solve the issue.

Other efforts to solve the issue was to limit all CUDA computations to match the Device limits as follows:

  size_t limit = 0;
  cudaDeviceGetLimit(&limit, cudaLimitStackSize);
  std::cout << "Stack limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitStackSize, limit);

  cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
  std::cout << "cudaLimitPrintfFifoSize limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitPrintfFifoSize, limit);

  cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
  std::cout << "cudaLimitMallocHeapSize limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitMallocHeapSize, limit);

  cudaDeviceGetLimit(&limit, cudaLimitDevRuntimeSyncDepth);
  std::cout << "cudaLimitDevRuntimeSyncDepth limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, limit);

  cudaDeviceGetLimit(&limit, cudaLimitDevRuntimePendingLaunchCount);
  std::cout << "cudaLimitDevRuntimePendingLaunchCount limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, limit);

  cudaDeviceGetLimit(&limit, cudaLimitMaxL2FetchGranularity);
  std::cout << "cudaLimitMaxL2FetchGranularity limit is: " << limit << std::endl;
  cudaDeviceSetLimit(cudaLimitMaxL2FetchGranularity, limit);

But not changes have been yielded.

Device Info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   56C    P8    18W /  N/A |    123MiB /  7982MiB |     32%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

  Dev PCI Bus/Dev ID  Name Description                                   SM Type  
*   0  01:00.0        NVIDIA GeForce RTX 2080 Super with Max-Q Design     TU104-A

SMs    Warps/SM Lanes/Warp Max Regs/Lane    Active SMs Mask 
sm_75  48       32         32           256 0x00000000000000000000ffffffffffff

Using Ros Noetic and Ubuntu 20.04

Robert_Crovella · March 24, 2023, 2:35pm

Topic		Replies	Views
CUDA "Warp Out of Range Address" - Voxel and Cluster Filter cuPCL CUDA Programming and Performance cuda , ubuntu , debugger , cuda-gdb , debugging-and-troubleshooting	2	1313	March 24, 2023
cuPCL Voxeldownsampling and Cluster Filter - "Warp out of Range Address" CUDA Programming and Performance cuda , cuda-gdb , cloud	2	821	March 24, 2023
Address out of bounds error - help! General Topics and Other SDKs	0	1020	August 8, 2019
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1699	March 20, 2017
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7851	June 14, 2013
'invalid device ordinal' (cudaErrorInvalidDevice) CUDA Programming and Performance	6	5396	August 25, 2015
Call to cuStreamSynchronize returned error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	14	2731	July 17, 2020
Tracking Invalid read size and illegal memory access CUDA Programming and Performance	3	7573	May 24, 2016
Program is giving an illegal memory access was encountered error with higher number of threads CUDA Programming and Performance	1	761	May 20, 2017
Cuda 9.1 code update issue with VS2015 CUDA Programming and Performance	7	2124	April 11, 2018

CUDA "Warp Out of Address" when instantiating cudaExtractCluster and cudaFilter objects (cuPCL)

Problem Explaination

Device Info

Related topics