Problem Explaination
Hello there ! I am trying to use the cuPCL repository: GitHub - NVIDIA-AI-IOT/cuPCL: A project demonstrating how to use the libs of cuPCL. such to preprocess the PointCloud by a Voxel Downsampling Filter prior to using a the defined Clusterer. The program runs smoothly without the Voxel Downsampling , but the problem comes when only making an Instance of the filter as shown below:
cudaExtractCluster cudaec(stream);
cudaFilter filterTest(stream);
or
cudaFilter filterTest(stream2); (Same or different streams does not matter)
So just using one works, but both produces: Cuda failure: an illegal memory access was encountered at line 138 in file cudaFilter.cpp error status: 700
After some trials with Debugging with CUDA-GDB and CUDA MEMCHECK I came to the following results but do not quite sure if they can be solved as the classes are implemented in a precompiled .so files:
- Both classes invoke the cudaFillVoxelGirdKernel, and the error occurs on the Kernel Launch of the first function call that invokes the Kernel Launch :
Thread 1 "collision_avoid" received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 6, block (3,0,0), thread (160,0,0), device 0, sm 6, warp 4, lane 0]
0x0000555555d50eb0 in cudaFillVoxelGirdKernel(float4*, int4*, int4*, float4*, unsigned int, float, float, float) ()
- The Thread is trying to write 4 bytes into some Global Memory address (CUDA MEMCHECK):
Invalid __global__ write of size 4
- And from debugging:
Illegal access to address (@global)0x8007b0800c60 detected
(cuda-gdb) print *0x8007b0800c60
Error: Failed to read local memory at address 0x8007b0800c60 on device 0 sm 0 warp 9 lane 0, error=CUDBG_ERROR_INVALID_MEMORY_ACCESS(0x8).
What I do not understand is that from the Thread’s scope the address is treated as a local address , but actually it seems to be a global one. Note that for memory transfer cudaMemMallocManaged has been used (UVM), and even using explicit memory transfers did not solve the issue.
Other efforts to solve the issue was to limit all CUDA computations to match the Device limits as follows:
size_t limit = 0;
cudaDeviceGetLimit(&limit, cudaLimitStackSize);
std::cout << "Stack limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitStackSize, limit);
cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
std::cout << "cudaLimitPrintfFifoSize limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, limit);
cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
std::cout << "cudaLimitMallocHeapSize limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, limit);
cudaDeviceGetLimit(&limit, cudaLimitDevRuntimeSyncDepth);
std::cout << "cudaLimitDevRuntimeSyncDepth limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, limit);
cudaDeviceGetLimit(&limit, cudaLimitDevRuntimePendingLaunchCount);
std::cout << "cudaLimitDevRuntimePendingLaunchCount limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, limit);
cudaDeviceGetLimit(&limit, cudaLimitMaxL2FetchGranularity);
std::cout << "cudaLimitMaxL2FetchGranularity limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitMaxL2FetchGranularity, limit);
But not changes have been yielded.
Device Info
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 56C P8 18W / N/A | 123MiB / 7982MiB | 32% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Dev PCI Bus/Dev ID Name Description SM Type
* 0 01:00.0 NVIDIA GeForce RTX 2080 Super with Max-Q Design TU104-A
SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask
sm_75 48 32 32 256 0x00000000000000000000ffffffffffff
Using Ros Noetic and Ubuntu 20.04